Introduction
In this project, we’re examining inspection scores for restaurants in and around Austin, Texas, from 2014 through early 2017. Scores are determined through routine inspections, usually conducted twice a year. If a score lower than 70 is recorded in a yearly inspection, the restaurant will be subjected to one or more followup visits to ensure a return to compliance. All of our restaurant inspection records include geographical information in the form of zip codes, allowing the use of additional datasets to further explore the data. Visit https://data.world/kvaughn/s-17-dv-project-6 to explore our datasets, or see below for the visualizations we’ve made of the data.
R Configuration
Below we display our sessionInfo().
Visualizations
Interesting Things
Chinatown
Generally speaking, low scores are outliers for restaurants. The inspector caught them on a bad day, perhaps. Look at the histogram for an indication of the proportion of passing scores (above 70) versus failing scores.
If you’re interested in seeing the whole collection of scores for a restaurant with at least one low score, use our Shiny boxplot and select the zipcode of interest. Some restaurants demonstrate a trend in their low scores that goes beyond “one bad day”. Chinatown, in 78731, is a good example of this. A score of 70 or higher is considered ‘passing,’ while a score below 70 triggers an automatic reinspection of the facility. Restaurants are encouraged to aim for scores approaching 100. Chinatown is apparently just aiming to pass. With an average score of 71, it’s meeting its goal.
The Limitations of our Dataset
We originally expected strong correlations between the number of restaurants in a given zip code and the population of that zip code, with an expectation that downtown would be an outlier (lots of restaurants, relatively few residents). When we graphed our data, we found that restaurant quantity and population were only weakly correlated, and we found a number of unexpected outliers.
Similarly, we expected that the number of transit stops would be strongly correlated with the number of restaurants in a given zipcode. This time, our data supported our expectations, with a moderately strong correlation of 0.756.
Curiosity then led us to plot population versus number of transit stops for a given zipcode; again, the two were only weakly correlated, with many of the same outliers.
This prompted a closer look at those outliers, and demonstrated a shortcoming in our dataset. It turns out that the outliers with large populations and few transit stops or restaurants are in the far suburbs of Austin. These areas are governed by other civic bodies, and thus the bulk of their restaurant inspections (and, potentially, transit) will be conducted by an entity other than the City of Austin. In order to get a cleaner view of the relationship between population, transit, and restaurant density, we would need to adjust our data to only include areas under the jurisdiction of the City of Austin. A good visualization of this distribution can be seen below.
Compare this to the map of the boundaries of the City of Austin, courtesy of MapTechnica:
Restaurant Density and Non-native Population
The United States Census provides information on the number of residents per zipcode that are native citizens, as well as how many residents are naturalized citizens or noncitizens. For the purposes of this visualization, we’ve grouped “naturalized citizens” and “noncitizens” into one category, “non-native.” Would zipcodes with relatively high percentages of non-native residents be more or less restaurant-heavy than zipcodes with lower percentages of non-native residents? It turns out that zip codes with high percentages (in the case of this visualization, 25% or more) non-native residents do in fact have significantly more restaurants on average than other zipcodes.
Visualizations using Tableau
Boxplot using Tableau
Raw Scores- Histogram
As can be seen in the visualization below, the most common raw restaurant inspection score is between 90 and 100. The 90-95 and 95-100 bins are essentially the same. The second most common ranges are 85-90 and 100-105. These are also the same size, but a little over half the size of the most common score range.
The next largest bin is half the size of the second largest, and this pattern continues until we drop below 70. This is particularly good news, as it implies that failing scores (those below 70) are not very common at all, and furthermore that exceptional scores are highly common.
CapMetro Stops and Restaurant Density as they Relate to Population- Scatterplot
Here we see a comprehensive visualization that reflects data as we’ve seen it before, in our “interesting things” section. This scatterplot shows restaurant and stop density per zipcode. Each point is sized by the population of that zipcode. Ultimately, we see an overall positive trend. However, upon closer examination, it is clear that many high population zipcodes have a lower restaurant and stop density.
The implications of this have already been discussed in previous sections. However, it is also interesting to note that many zipcodes (regardless of population size), are grouped towards the lower to medium restaurant and stop density area. In fact, the further out in either direction you go, the less instances there are. This likely shows that while there is a slight positive correlation between stop and restaurant density, most areas are not likely to have an exceptional number of eihter.
Visualizations using Shiny in R
The screenshots and descriptions below are based on our interactive visualization app, accessible at https://kvaughn.shinyapps.io/restaurant_inspection/.
Boxplot Visualization using Shiny in R
One low score does not indicate a systemic problem with a restaurant’s cleanliness; before judging a restaurant with a low score, it behooves us to consider the other scores it has received. This boxplot allows the user to do just that. First select a cutoof point for scores, then choose a zipcode to see all restaurants that have scored below that level at least one. Each restaurant’s scores will display, giving the user an idea of whether the low score was a fluke or a sign of problems.
Histogram Visualization using Shiny in R
This visualization allows the user to get a broader view of the restaurant inspection scores. The user is given the option to select scores from any combination of years. What’s interesting here is that the vast majority of restaurant inspections result in “passing” grades of 70 or higher.
Crosstab Visualizations using Shiny in R
Click here to be taken to our published Shiny app: https://kvaughn.shinyapps.io/restaurant_inspection/
Safety Index
This visualization starts with a somewhat-arbitrary KPI, the “Safety Index,” that takes into account both the average score and the lowest inspection score received for a given zip code in a given year. A high Safety Index indicates both a high average score and a high minimum score. Set the parameters however you like (3 and 8 are reasonable defaults) and check the crosstab to see each zip code’s ratings over the four-year period.
Lowest Inspection Scores
This visualization allows the user to set an arbitrary cutoff point as an aid for identifying low inspection scores. The default value, 70, is the cutoff for restaurant inspections; lower scores will trigger a reinspection. After selecting a parameter using the slider and requesting data, the user can select the “Crosstab” tab to see which zip codes and years had unacceptably low inspection scores.
People per Restaurant
Austin is a fast-growing city, and our restaurant scene is currently booming as well. This visualization allows a user to see where the density of restaurants compared to population is highest. Using a public population dataset generated by the US Census, we can find the number of people per restaurant in a given zip code; as more restaurants are added (or fail), the people-per-restaurant number changes. The default values highlight the density of restaurants in 78701, which is downtown Austin.
Barchart Visualizations using Shiny in R
These visualizations are also accessible at https://kvaughn.shinyapps.io/restaurant_inspection/
Scores by Zip Code
Barcharts can be a good way to quickly identify trends visually. In this visualization, the user can create a set of zip codes of interest, then see each zipcode’s average score per year over the three or four years contained in the dataset.
Transit stops and restaurant density
Capital Metro, the Austin-area transit service, has made its data freely available for public download. By importing this data into geographic information software (ArcGIS) and overlaying the coordinates of transit stops on a shapefile of regional zip codes, we were able to count the number of transit stops per zip code. In this bar chart, we’re examining the relationship between transit density and restaurant density. Use the slider to select zip codes with more than the minimum number of transit stops, then compare the number of restaurants in these zip codes to the average number of restaurants per zip code throughout the region. The blue line is the calculated average for all of the selected zip codes, and the red is the grand average for all of the zipcodes in the Greater Austin area.
Nonnative residents and restaurant density
Would we expect zip codes with comparatively large numbers of nonnative (naturalized-citizen or noncitizen) residents to have more or fewer restaurants, on average, than zip codes with lower percentages of nonnative residents? This barchart allows the user to select a minimum percentage of nonnative residents per zip code, then shows the number of restaurants in that zip code and the average number of restaurants per zip code for all of the qualifying zip codes. This average can be compared to the grand average throughout the greater Austin area, shown in red.
<<<<<<< HEAD
---
title: "Austin Restarant Inspections"
author: "<center><b>Marcus Martinez and Kate Vaughn</b></center>"
date: "<center>May 2, 2017</center>"
output:
  html_notebook:
    toc: yes
    toc_depth: 4
  html_document:
    toc: yes
    toc_depth: 4
---

#**Introduction**
 In this project, we're examining inspection scores for restaurants in and around Austin, Texas, from 2014 through early 2017.  Scores are determined through routine inspections, usually conducted twice a year.  If a score lower than 70 is recorded in a yearly inspection, the restaurant will be subjected to one or more followup visits to ensure a return to compliance.  All of our restaurant inspection records include geographical information in the form of zip codes, allowing the use of additional datasets to further explore the data.  Visit https://data.world/kvaughn/s-17-dv-project-6 to explore our datasets, or see below for the visualizations we've made of the data.

##**R Configuration**
Below we display our sessionInfo().

```{r SessionInfo, echo=FALSE} 
sessionInfo(package=NULL)
```

#**The Data**
The restaurant inspection dataset was provided through the City of Austin's Open Data Portal, https://data.austintexas.gov/.  This free download is available in multiple formats at https://data.austintexas.gov/Health/Restaurant-Inspection-Scores/uthw-a7ih.  For this project, we downloaded a CSV file and cleaned it using the Extract-Translate-Load procedure outlined below.

#**Extract-Translate-Load (ETL)**
The dataset we received needed some cleaning, as it was unreadable to Tableau in its original CSV formatting.  We used the Extract-Translate-Load method discussed in class, and built an R script found at "../01 Data/ETLRestaurant.R"  We also cleaned up the join query used in our Tableau visualizations; it can be found at "../01 Data/ETLQuery.R"  If you'd like to follow along, here's how we processed the restaurant inspection data:

##*Step-By-Step Instructions*
**First, download the uncleaned CSV: https://data.austintexas.gov/api/views/ecmv-9xxi/rows.csv?accessType=DOWNLOAD. Prepend "PreETL_" to the file name, and put it in a folder called CSVs in the same directory as this project (s17dvproject5-vaughn-cannata-martinez).**

###*Step 1: Set up the environment*
Note that we have one column of numeric data ("Score"), one column of dates ("Inspection Date"), three columns of strings ("Restaurant Name", "Address", and "Process Description"), and two columns of numeric data used categorically ("Zip Code" and "Facility ID").  For the sake of ease of use, these last two columns will be treated as strings.
```{r etl1}
# Set working directory
setwd("~/s17dvfinalproject-vaughn-cannata-martinez-1/00 Docs")

# Set filepath
file_path = "../../CSVs/PreETL_Restaurant_Inspection_Scores.csv"

# Create dataframe from original CSV
df <- read.csv(file_path, stringsAsFactors = FALSE)

# Standardize column names
names(df) <- gsub("\\.+", " ", names(df))

# Print column names and types
str(df)
```

###*Step 2: Collect columns by type*
We've grouped the date column separately from measures and dimensions; while Tableau will recognize these data as dates, it seems prudent to reclassify them into a standardized format.
```{r etl2}
# Select string data as 'dimensions'
dimensions <- c("Restaurant Name", "Zip Code", "Address", "Process Description", "Facility ID")

# Select dates as 'dates'
dates <- c("Inspection Date")

# Select all remaining data as 'measures'
measures <- setdiff(names(df), union(dimensions, dates))

```

###*Step 3: Remove special characters*
Here we remove special characters from all of our data.  These characters can serve no purpose in the data, and might cause problems in the analysis process.  To improve readability, special characters have been replaced by spaces.
```{r etl3}
# Get rid of special characters in each column.
for(n in names(df)) {
  df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= " "), stringsAsFactors = FALSE)
}
```

###*Step 4: Modify records in dimensions*
Now we go through all of the character-based columns (grouped as dimensions) and remove troublesome characters.  Ampersands are replaced with "and", semicolons are replaced with colons, and single- and double-quotes are removed outright.  Data of NA (null) is replaced with an empty string.
```{r etl4}
# This function will replace NA data with an empty string
na2emptyString <- function (x) {
  x[is.na(x)] <- ""
  return(x)
}
# We'll apply this to all columns grouped as dimensions
if( length(dimensions) > 0) {
  for(d in dimensions) {
    # Change NA to the empty string.
    df[d] <- data.frame(lapply(df[d], na2emptyString), stringsAsFactors = FALSE)
    # Get rid of " and ' in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""), stringsAsFactors = FALSE)
    # Change & to and in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "), stringsAsFactors = FALSE)
    # Change : to ; in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"), stringsAsFactors = FALSE)
  }
}

```

###*Step 5: Modify dates*
Here we ensure that dates are formatted correctly.
```{r etl5}
if( length(dates) > 1 || ! is.na(dates)) {
  for(y in dates) {
    # Format as dates
    df[y] <- data.frame(lapply(df[y], function(y) as.Date(y, format = "%m/%d/%y")), stringsAsFactors = FALSE)
  }
}
```

###*Step 6: Modify measures*
Now we take all of the non-numeric characters out of our measures.
```{r etl6}
na2zero <- function (x) {
  x[is.na(x)] <- 0
  return(x)
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions, and change NA to 0.
if(length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement=""), stringsAsFactors = FALSE)
    df[m] <- data.frame(lapply(df[m], na2zero), stringsAsFactors = FALSE)
    df[m] <- data.frame(lapply(df[m], function(m) as.numeric(as.character(m)))) # This is needed to turn measures back to numeric because gsub turns them into strings.
  }
}
```

###*Step 7: Dummy check*
Always a useful thing.  Let's look at our data and see what we've got.  We're expecting all data to be in character form except "Score", which will be numeric, and our date column named "Inspection Date".
```{r etl7}
# Take a look and make sure it's what you think it should be:
print(summary(df))
```

###*Step 8: Write to a new file*
Our data looks good, and we need to write it to a new file.
```{r etl8}
# Now write it to a new clean file.
write.csv(df, gsub("PreETL_", "", file_path), row.names=FALSE, na = "")
```

#**Visualizations**

##**Interesting Things**
###**Chinatown**  
Generally speaking, low scores are outliers for restaurants.  The inspector caught them on a bad day, perhaps.  Look at the histogram for an indication of the proportion of passing scores (above 70) versus failing scores.  
<center>![All Scores Histogram](../03 Visualizations/SHistB.png)</center>
<br>
If you're interested in seeing the whole collection of scores for a restaurant with at least one low score, use our Shiny boxplot and select the zipcode of interest.  Some restaurants demonstrate a trend in their low scores that goes beyond "one bad day".  Chinatown, in 78731, is a good example of this.  A score of 70 or higher is considered 'passing,' while a score below 70 triggers an automatic reinspection of the facility.  Restaurants are encouraged to aim for scores approaching 100.  Chinatown is apparently just aiming to pass.  With an average score of 71, it's meeting its goal.
<center>![Chinatown's Average Score](../03 Visualizations/SBoxB.png)</center>
<br>

###**The Limitations of our Dataset** 
We originally expected strong correlations between the number of restaurants in a given zip code and the population of that zip code, with an expectation that downtown would be an outlier (lots of restaurants, relatively few residents).  When we graphed our data, we found that restaurant quantity and population were only weakly correlated, and we found a number of unexpected outliers.  
<center>![Restaurant by Population Scatterplot](../03 Visualizations/SScatter1.png)</center>
<br>
Similarly, we expected that the number of transit stops would be strongly correlated with the number of restaurants in a given zipcode.  This time, our data supported our expectations, with a moderately strong correlation of 0.756.  
<center>![Restaurant by Transit Scatterplot](../03 Visualizations/SScatter3.png)</center>
<br>
Curiosity then led us to plot population versus number of transit stops for a given zipcode; again, the two were only weakly correlated, with many of the same outliers.  
<center>![Transit by Population Scatterplot](../03 Visualizations/SScatter2.png)</center> 
<br>
This prompted a closer look at those outliers, and demonstrated a shortcoming in our dataset.  It turns out that the outliers with large populations and few transit stops or restaurants are in the far suburbs of Austin.  These areas are governed by other civic bodies, and thus the bulk of their restaurant inspections (and, potentially, transit) will be conducted by an entity other than the City of Austin.  In order to get a cleaner view of the relationship between population, transit, and restaurant density, we would need to adjust our data to only include areas under the jurisdiction of the City of Austin. 
A good visualization of this distribution can be seen below.
<center>![Dataset Limitations Map](../03 Visualizations/TBOutlierMap.png)</center>
<br>
Compare this to the map of the boundaries of the City of Austin, courtesy of MapTechnica:
<center>![CoA Boundaries](../03 Visualizations/COABoundaries.png)</center>
<br>

###**Restaurant Density and Non-native Population**
The United States Census provides information on the number of residents per zipcode that are native citizens, as well as how many residents are naturalized citizens or noncitizens.  For the purposes of this visualization, we've grouped "naturalized citizens" and "noncitizens" into one category, "non-native."  Would zipcodes with relatively high percentages of non-native residents be more or less restaurant-heavy than zipcodes with lower percentages of non-native residents?  It turns out that zip codes with high percentages (in the case of this visualization, 25% or more) non-native residents do in fact have significantly more restaurants on average than other zipcodes.
<center>![Non-native Residents](../03 Visualizations/Nonnative.png)</center>
<br>


##**Visualizations using Tableau**

###**Boxplot using Tableau**

###**Raw Scores- Histogram**
As can be seen in the visualization below, the most common raw restaurant inspection score is between 90 and 100. The 90-95 and 95-100 bins are essentially the same. The second most common ranges are 85-90 and 100-105. These are also the same size, but a little over half the size of the most common score range. 

<center>![Tableau: Raw Score Histogram](../03 Visualizations/THist.png)</center>

The next largest bin is half the size of the second largest, and this pattern continues until we drop below 70. This is particularly good news, as it implies that failing scores (those below 70) are not very common at all, and furthermore that exceptional scores are highly common.

###**CapMetro Stops and Restaurant Density as they Relate to Population- Scatterplot**
Here we see a comprehensive visualization that reflects data as we've seen it before, in our "interesting things" section. This scatterplot shows restaurant and stop density per zipcode. Each point is sized by the population of that zipcode. Ultimately, we see an overall positive trend. However, upon closer examination, it is clear that many high population zipcodes have a lower restaurant and stop density. 

<center>![Tableau: Restaurant and Stop Density Scatterplot](../03 Visualizations/TBScatterplot.png)</center>

The implications of this have already been discussed in previous sections. However, it is also interesting to note that many zipcodes (regardless of population size), are grouped towards the lower to medium restaurant and stop density area. In fact, the further out in either direction you go, the less instances there are. This likely shows that while there is a slight positive correlation between stop and restaurant density, most areas are not likely to have an exceptional number of eihter. 

###**Crosstab-related Visualizations using Tableau**
Below you will find the visualizations we created using Tableau. Each visualization is only a fraction of the information that can be viewed in Tableau, as these crosstabs cannot showcase all the information available at once.  Our Tableau workbook is included in our repository in the folder 01 Data.

####*Inspection Scores, Area, and Income*
This crosstab shows by year and zipcode (in Austin), the average restaurant inspection score. The crosstab is colored by the percentage of households in each area under the national poverty line (pulled from US Census data). As is evident, average scores remain relatively high across poverty levels and zipcodes. In Tableau you are able to cycle through zipcodes in the western or eastern half of the greater Austin area. Even at this level, average scores remain similar. There are more average scores below 90 in the eastern section, but this is likely not significant.

<center>![Tableau: Scores by Income and Area](../03 Visualizations/CrosstabAverageScorebyPovertyLevel.png)</center>

It is possible this is unique to Austin, and may be a consequence of stringent food safety code in these areas. Further analysis would need to be done to see if these correlations are true elsewhere in the United States.

####*Average Inspection Scores Geographically*
As can be seen, this visualization displays the average inspections score for each zipcode area. Each zipcode's color is determined by its average inspection score's proximity to the grand average for the year(s) selected in the filter.  This is done using a calculated field, and will be recalculated according the the data selected in the Year filter.  If arriving at this map by clicking on either of the crosstabs, the zip codes will display the difference from the grand average of the selected year.

<center>![Tableau: Map of Average Inspection Scores](../03 Visualizations/CrosstabInspectionScoreMap.png)</center>

It is again evident that average scores are relatively high across each area. 

####*Average Score by Year*
This highly comprehensive visualization shows us three main pieces of information: average score, minimum score, and safety index. As has been previously noted, average scores are relatively high in each zipcode. Minumum scores can be quite low (one of the lowest scores is 36), but these are few and far between.

<center>![Tableau: Average Score by Year](../03 Visualizations/CrosstabScoresbyYear.png)</center>

This crosstab is also colored by a Safety Index.  This KPI is based in a calculated field that accounts for the average score and minimum score for a given zip code in a given year, then parameterized into Low, Medium, and High rankings.  The break point for each of these groupings can be adjusted using the sliders to the right of the crosstab.  

###**Barchart-related Visualizations using Tableau**

####*CapMetro Stops by Zipcode as they Relate to Restaurant Density*
Here we see visualized the number of bus stops per zipcode, colored by the number of restaurants in that zipcode. This visualization also shows an average line for the average number of stops. As is evident, the zipcodes with the darkest bars (those with the highest restuarant density), almost always surpass the average line, but also generally greatly surpass it. This is interesting because it suggests that, in general, zipcodes with a higher than average number of bus stops also have a high restuarant density.

<center>![Tableau: Stops by Zipcode](../03 Visualizations/TBar1A.png)</center>
<br>
<center>![Tableau: Stops by Zipcode, High Restaurant Density](../03 Visualizations/TBar1B.png)</center>

In Tableau, this visualization also allows you to adjust which zipcodes are displayed based on a minimum amount of restaurants, and also based on whether the zipcodes are above or below this minimum.

####*Foreign Population vs. Restaurant Density*
Here we find a visualization that showcases foreign population by zipcode, and is colored to reflect the restaurant density of each zipcode. In Tableau, you are also able to toggle back and forth between a higher than average, or an at/below average foreign population.

<center>![Tableau: Restaurant Density, Low Foreign Population](../03 Visualizations/TBar2A.png)</center>
<br>
<center>![Tableau: Restaurant Density, High Foreign Population](../03 Visualizations/TBar2B.png)</center>

From this barchart, it would seem that higher than average foreign populated zipcodes don't necessarily have a higher restaurant density. Overall, it appears that restaurant density is pretty even across levels of foreign-born population (i.e. there are relatively similar numbers of high restaurant density zipcodes in both categories). This is interesting because it suggests that restaurant density may be fairly independent of foreign- or native-born population status.

####*Stop to Restuarant Ratios per Zipcode by Nativity*
This visualization shows the populations of Native, Naturalized, and Non-Cititzen inhabitants for each zipcode. Further, these barcharts are colored by the restaurant to stop ratio (darker purple indicates a higher ratio). The highest ratios are between 78721 and 78724. 

<center>![Tableau: Stops to Restaurants by Nativity](../03 Visualizations/TBar3.png)</center>

Interestingly, the population of native, naturalized, and non-citizen peoples is quite variable across zipcodes. Another interesting point is that for all three categories, the highest ratios are not correlated with the highest populations. This, of course, seems non-intuitive, and may be an area for future inquiry.

##**Visualizations using Shiny in R**
The screenshots and descriptions below are based on our interactive visualization app, accessible at https://kvaughn.shinyapps.io/restaurant_inspection/.

###**Boxplot Visualization using Shiny in R**
One low score does not indicate a systemic problem with a restaurant's cleanliness; before judging a restaurant with a low score, it behooves us to consider the other scores it has received.  This boxplot allows the user to do just that.  First select a cutoof point for scores, then choose a zipcode to see all restaurants that have scored below that level at least one.  Each restaurant's scores will display, giving the user an idea of whether the low score was a fluke or a sign of problems.

<center>![Shiny: Scores by Restaurant Data](../03 Visualizations/SBoxA.png)</center> 
<br> 
<center>![Shiny: Scores by Restaurant Boxplot](../03 Visualizations/SBoxB.png)</center>

###**Histogram Visualization using Shiny in R**
This visualization allows the user to get a broader view of the restaurant inspection scores.  The user is given the option to select scores from any combination of years.  What's interesting here is that the vast majority of restaurant inspections result in "passing" grades of 70 or higher.

<center>![Shiny: All Scores Data](../03 Visualizations/SHistA.png)</center> 
<br>
<center>![Shiny: All Scores Histogram](../03 Visualizations/SHistB.png)</center>

###**Scatterplot Visualization using Shiny in R**
The Scatterplot tab includes three separate visualizations drawn from one set of data.  The first visualization plots the population of a zipcode against the number of restaurants in that zipcode.  The second graph plots the population per zipcode against the quantity of transit stops in that zipcode.  (Transit stop location information provided by Capital Metro and made available through data.texas.gov: https://data.texas.gov/dataset/GTFS/r4v4-vz24 )  As you see, the correlations between population and restaurant or transity-stop quantity are pretty weak.  The third visualization in this set graphs transit stops by restaurants, and shows a much stronger correlation.  See above for a discussion of one possible explanation for this pattern of relationships.

<center>![Shiny: Scatterplot Data](../03 Visualizations/SScatterData.png)</center> 
<br>
<center>![Shiny: Restaurant by Population Scatterplot](../03 Visualizations/SScatter1.png)</center>
<br>
<center>![Shiny: Transit by Population Scatterplot](../03 Visualizations/SScatter2.png)</center> 
<br>
<center>![Shiny: Restaurant by Transit Scatterplot](../03 Visualizations/SScatter3.png)</center>

###**Crosstab Visualizations using Shiny in R**
Click here to be taken to our published Shiny app: https://kvaughn.shinyapps.io/restaurant_inspection/

####*Safety Index*
This visualization starts with a somewhat-arbitrary KPI, the "Safety Index," that takes into account both the average score and the lowest inspection score received for a given zip code in a given year.  A high Safety Index indicates both a high average score and a high minimum score.  Set the parameters however you like (3 and 8 are reasonable defaults) and check the crosstab to see each zip code's ratings over the four-year period.

<center>![Shiny: Safety Index Data](../03 Visualizations/SCross1A.png)</center> 
<br>
<center>![Shiny: Safety Index Crosstab](../03 Visualizations/SCross1B.png)</center>

####*Lowest Inspection Scores*
This visualization allows the user to set an arbitrary cutoff point as an aid for identifying low inspection scores.  The default value, 70, is the cutoff for restaurant inspections; lower scores will trigger a reinspection.  After selecting a parameter using the slider and requesting data, the user can select the "Crosstab" tab to see which zip codes and years had unacceptably low inspection scores.

<center>![Shiny: Lowest Inspection Scores Data](../03 Visualizations/SCross2A.png)</center>
<br>
<center>![Shiny: Lowest Inspection Scores Crosstab](../03 Visualizations/SCross2B.png)</center>

####*People per Restaurant*
Austin is a fast-growing city, and our restaurant scene is currently booming as well.  This visualization allows a user to see where the density of restaurants compared to population is highest.  Using a public population dataset generated by the US Census, we can find the number of people per restaurant in a given zip code; as more restaurants are added (or fail), the people-per-restaurant number changes.  The default values highlight the density of restaurants in 78701, which is downtown Austin.

<center>![Shiny: People per Restaurant Data](../03 Visualizations/SCross3A.png)</center> 
<br>
<center>![Shiny: People per Restaurant Crosstab](../03 Visualizations/SCross3B.png)</center>

###**Barchart Visualizations using Shiny in R**
These visualizations are also accessible at https://kvaughn.shinyapps.io/restaurant_inspection/

####*Scores by Zip Code*
Barcharts can be a good way to quickly identify trends visually.  In this visualization, the user can create a set of zip codes of interest, then see each zipcode's average score per year over the three or four years contained in the dataset.

<center>![Shiny: Scores by Zip Code Data](../03 Visualizations/SBar1A.png)</center> 
<br>
<center>![Shiny: Scores by Zip Code Bar Chart](../03 Visualizations/SBar1B.png)</center>

####*Transit stops and restaurant density*
Capital Metro, the Austin-area transit service, has made its data freely available for public download.  By importing this data into geographic information software (ArcGIS) and overlaying the coordinates of transit stops on a shapefile of regional zip codes, we were able to count the number of transit stops per zip code.  In this bar chart, we're examining the relationship between transit density and restaurant density.  Use the slider to select zip codes with more than the minimum number of transit stops, then compare the number of restaurants in these zip codes to the average number of restaurants per zip code throughout the region.  The blue line is the calculated average for all of the selected zip codes, and the red is the grand average for all of the zipcodes in the Greater Austin area.

<center>![Shiny: Restaurant Density and Transit Stops Data](../03 Visualizations/SBar2A.png)</center> 
<br>
<center>![Shiny: Restaurant Density and Transit Stops Bar Chart](../03 Visualizations/SBar2B.png)</center>

####*Nonnative residents and restaurant density*
Would we expect zip codes with comparatively large numbers of nonnative (naturalized-citizen or noncitizen) residents to have more or fewer restaurants, on average, than zip codes with lower percentages of nonnative residents?  This barchart allows the user to select a minimum percentage of nonnative residents per zip code, then shows the number of restaurants in that zip code and the average number of restaurants per zip code for all of the qualifying zip codes.  This average can be compared to the grand average throughout the greater Austin area, shown in red.

<center>![Shiny: Restaurant Density and Nonnative Residents Data](../03 Visualizations/SBar3A.png)</center> 
<br>
<center>![Shiny: Restaurant Density and Nonnative Residents Bar Chart](../03 Visualizations/SBar3B.png)</center>

=======
<<<<<<< HEAD
---
title: "Austin Restarant Inspections"
author: "<center><b>Marcus Martinez and Kate Vaughn</b></center>"
date: "<center>May 2, 2017</center>"
output:
  html_notebook:
    toc: yes
    toc_depth: 4
  html_document:
    toc: yes
    toc_depth: 4
---

#**Introduction**
 In this project, we're examining inspection scores for restaurants in and around Austin, Texas, from 2014 through early 2017.  Scores are determined through routine inspections, usually conducted twice a year.  If a score lower than 70 is recorded in a yearly inspection, the restaurant will be subjected to one or more followup visits to ensure a return to compliance.  All of our restaurant inspection records include geographical information in the form of zip codes, allowing the use of additional datasets to further explore the data.  Visit https://data.world/kvaughn/s-17-dv-project-6 to explore our datasets, or see below for the visualizations we've made of the data.

##**R Configuration**
Below we display our sessionInfo().

```{r SessionInfo, echo=FALSE} 
sessionInfo(package=NULL)
```

#**The Data**
The restaurant inspection dataset was provided through the City of Austin's Open Data Portal, https://data.austintexas.gov/.  This free download is available in multiple formats at https://data.austintexas.gov/Health/Restaurant-Inspection-Scores/uthw-a7ih.  For this project, we downloaded a CSV file and cleaned it using the Extract-Translate-Load procedure outlined below.

#**Extract-Translate-Load (ETL)**
The dataset we received needed some cleaning, as it was unreadable to Tableau in its original CSV formatting.  We used the Extract-Translate-Load method discussed in class, and built an R script found at "../01 Data/ETLRestaurant.R"  We also cleaned up the join query used in our Tableau visualizations; it can be found at "../01 Data/ETLQuery.R"  If you'd like to follow along, here's how we processed the restaurant inspection data:

##*Step-By-Step Instructions*
**First, download the uncleaned CSV: https://data.austintexas.gov/api/views/ecmv-9xxi/rows.csv?accessType=DOWNLOAD. Prepend "PreETL_" to the file name, and put it in a folder called CSVs in the same directory as this project (s17dvproject5-vaughn-cannata-martinez).**

###*Step 1: Set up the environment*
Note that we have one column of numeric data ("Score"), one column of dates ("Inspection Date"), three columns of strings ("Restaurant Name", "Address", and "Process Description"), and two columns of numeric data used categorically ("Zip Code" and "Facility ID").  For the sake of ease of use, these last two columns will be treated as strings.
```{r etl1}
# Set working directory
setwd("~/s17dvfinalproject-vaughn-cannata-martinez-1/00 Docs")

# Set filepath
file_path = "../../CSVs/PreETL_Restaurant_Inspection_Scores.csv"

# Create dataframe from original CSV
df <- read.csv(file_path, stringsAsFactors = FALSE)

# Standardize column names
names(df) <- gsub("\\.+", " ", names(df))

# Print column names and types
str(df)
```

###*Step 2: Collect columns by type*
We've grouped the date column separately from measures and dimensions; while Tableau will recognize these data as dates, it seems prudent to reclassify them into a standardized format.
```{r etl2}
# Select string data as 'dimensions'
dimensions <- c("Restaurant Name", "Zip Code", "Address", "Process Description", "Facility ID")

# Select dates as 'dates'
dates <- c("Inspection Date")

# Select all remaining data as 'measures'
measures <- setdiff(names(df), union(dimensions, dates))

```

###*Step 3: Remove special characters*
Here we remove special characters from all of our data.  These characters can serve no purpose in the data, and might cause problems in the analysis process.  To improve readability, special characters have been replaced by spaces.
```{r etl3}
# Get rid of special characters in each column.
for(n in names(df)) {
  df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= " "), stringsAsFactors = FALSE)
}
```

###*Step 4: Modify records in dimensions*
Now we go through all of the character-based columns (grouped as dimensions) and remove troublesome characters.  Ampersands are replaced with "and", semicolons are replaced with colons, and single- and double-quotes are removed outright.  Data of NA (null) is replaced with an empty string.
```{r etl4}
# This function will replace NA data with an empty string
na2emptyString <- function (x) {
  x[is.na(x)] <- ""
  return(x)
}
# We'll apply this to all columns grouped as dimensions
if( length(dimensions) > 0) {
  for(d in dimensions) {
    # Change NA to the empty string.
    df[d] <- data.frame(lapply(df[d], na2emptyString), stringsAsFactors = FALSE)
    # Get rid of " and ' in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""), stringsAsFactors = FALSE)
    # Change & to and in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "), stringsAsFactors = FALSE)
    # Change : to ; in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"), stringsAsFactors = FALSE)
  }
}

```

###*Step 5: Modify dates*
Here we ensure that dates are formatted correctly.
```{r etl5}
if( length(dates) > 1 || ! is.na(dates)) {
  for(y in dates) {
    # Format as dates
    df[y] <- data.frame(lapply(df[y], function(y) as.Date(y, format = "%m/%d/%y")), stringsAsFactors = FALSE)
  }
}
```

###*Step 6: Modify measures*
Now we take all of the non-numeric characters out of our measures.
```{r etl6}
na2zero <- function (x) {
  x[is.na(x)] <- 0
  return(x)
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions, and change NA to 0.
if(length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement=""), stringsAsFactors = FALSE)
    df[m] <- data.frame(lapply(df[m], na2zero), stringsAsFactors = FALSE)
    df[m] <- data.frame(lapply(df[m], function(m) as.numeric(as.character(m)))) # This is needed to turn measures back to numeric because gsub turns them into strings.
  }
}
```

###*Step 7: Dummy check*
Always a useful thing.  Let's look at our data and see what we've got.  We're expecting all data to be in character form except "Score", which will be numeric, and our date column named "Inspection Date".
```{r etl7}
# Take a look and make sure it's what you think it should be:
print(summary(df))
```

###*Step 8: Write to a new file*
Our data looks good, and we need to write it to a new file.
```{r etl8}
# Now write it to a new clean file.
write.csv(df, gsub("PreETL_", "", file_path), row.names=FALSE, na = "")
```

#**Visualizations**

##**Interesting Things**
###**Chinatown.**  
Generally speaking, low scores are outliers for restaurants.  The inspector caught them on a bad day, perhaps.  Look at the histogram for an indication of the proportion of passing scores (above 70) versus failing scores.  
<center>![All Scores Histogram](../03 Visualizations/SHistB.png)</center>
If you're interested in seeing the whole collection of scores for a restaurant with at least one low score, use the boxplot and select the zipcode of interest.  Some restaurants demonstrate a trend in their low scores that goes beyond "one bad day".  Chinatown, in 78731, is a good example of this.  A score of 70 or higher is considered 'passing,' while a score below 70 triggers an automatic reinspection of the facility.  Restaurants are encouraged to aim for scores approaching 100.  Chinatown is apparently just aiming to pass.  With an average score of 71, it's meeting its goal.
<center>![Chinatown's Average Score](../03 Visualizations/SBoxB.png)</center>
<br>

###**The limitations of our dataset.** 
We originally expected strong correlations between the number of restaurants in a given zip code and the population of that zip code, with an expectation that downtown would be an outlier (lots of restaurants, relatively few residents).  When we graphed our data, we found that restaurant quantity and population were only weakly correlated, and we found a number of unexpected outliers.  
<center>![Restaurant by Population Scatterplot](../03 Visualizations/SScatter1.png)</center>
Similarly, we expected that the number of transit stops would be strongly correlated with the number of restaurants in a given zipcode.  This time, our data supported our expectations, with a moderately strong correlation of 0.756.  
<center>![Restaurant by Transit Scatterplot](../03 Visualizations/SScatter3.png)</center>
Curiosity then led us to plot population versus number of transit stops for a given zipcode; again, the two were only weakly correlated, with many of the same outliers.  
<center>![Transit by Population Scatterplot](../03 Visualizations/SScatter2.png)</center> 
This prompted a closer look at those outliers, and demonstrated a shortcoming in our dataset.  It turns out that the outliers with large populations and few transit stops or restaurants are in the far suburbs of Austin.  These areas are governed by other civic bodies, and thus the bulk of their restaurant inspections (and, potentially, transit) will be conducted by an entity other than the City of Austin.  In order to get a cleaner view of the relationship between population, transit, and restaurant density, we would need to adjust our data to only include areas under the jurisdiction of the City of Austin. 
A good visualization of this distribution can be seen below.
<center>![Dataset Limitations Map](../03 Visualizations/TBOutlierMap.png)</center>
<br>

###**Restaurant Density and Non-native Population**


##**Visualizations using Tableau**

###**Boxplot using Tableau**

<center>![Tableau: Scores Below Cutoff Boxplot](../03 Visualizations/TBoxplot.png)</center>

###**Raw Scores- Histogram**
As can be seen in the visualization below, the most common raw restaurant inspection score is between 90 and 100. The 90-95 and 95-100 bins are essentially the same. The second most common ranges are 85-90 and 100-105. These are also the same size, but a little over half the size of the most common score range. 

<center>![Tableau: Raw Score Histogram](../03 Visualizations/THist.png)</center>

The next largest bin is half the size of the second largest, and this pattern continues until we drop below 70. This is particularly good news, as it implies that failing scores (those below 70) are not very common at all, and furthermore that exceptional scores are highly common.

###**CapMetro Stops and Restaurant Density as they Relate to Population- Scatterplot**
Here we see a comprehensive visualization that reflects data as we've seen it before, in our "interesting things" section. This scatterplot shows restaurant and stop density per zipcode. Each point is sized by the population of that zipcode. Ultimately, we see an overall positive trend. However, upon closer examination, it is clear that many high population zipcodes have a lower restaurant and stop density. 

<center>![Tableau: Restaurant and Stop Density Scatterplot](../03 Visualizations/TBScatterplot.png)</center>

The implications of this have already been discussed in previous sections. However, it is also interesting to note that many zipcodes (regardless of population size), are grouped towards the lower to medium restaurant and stop density area. In fact, the further out in either direction you go, the less instances there are. This likely shows that while there is a slight positive correlation between stop and restaurant density, most areas are not likely to have an exceptional number of eihter. 

###**Crosstab-related Visualizations using Tableau**
Below you will find the visualizations we created using Tableau. Each visualization is only a fraction of the information that can be viewed in Tableau, as these crosstabs cannot showcase all the information available at once.  Our Tableau workbook is included in our repository in the folder 01 Data.

####*Inspection Scores, Area, and Income*
This crosstab shows by year and zipcode (in Austin), the average restaurant inspection score. The crosstab is colored by the percentage of households in each area under the national poverty line (pulled from US Census data). As is evident, average scores remain relatively high across poverty levels and zipcodes. In Tableau you are able to cycle through zipcodes in the western or eastern half of the greater Austin area. Even at this level, average scores remain similar. There are more average scores below 90 in the eastern section, but this is likely not significant.

<center>![Tableau: Scores by Income and Area](../03 Visualizations/CrosstabAverageScorebyPovertyLevel.png)</center>

It is possible this is unique to Austin, and may be a consequence of stringent food safety code in these areas. Further analysis would need to be done to see if these correlations are true elsewhere in the United States.

####*Average Inspection Scores Geographically*
As can be seen, this visualization displays the average inspections score for each zipcode area. Each zipcode's color is determined by its average inspection score's proximity to the grand average for the year(s) selected in the filter.  This is done using a calculated field, and will be recalculated according the the data selected in the Year filter.  If arriving at this map by clicking on either of the crosstabs, the zip codes will display the difference from the grand average of the selected year.

<center>![Tableau: Map of Average Inspection Scores](../03 Visualizations/CrosstabInspectionScoreMap.png)</center>

It is again evident that average scores are relatively high across each area. 

####*Average Score by Year*
This highly comprehensive visualization shows us three main pieces of information: average score, minimum score, and safety index. As has been previously noted, average scores are relatively high in each zipcode. Minumum scores can be quite low (one of the lowest scores is 36), but these are few and far between.

<center>![Tableau: Average Score by Year](../03 Visualizations/CrosstabScoresbyYear.png)</center>

This crosstab is also colored by a Safety Index.  This KPI is based in a calculated field that accounts for the average score and minimum score for a given zip code in a given year, then parameterized into Low, Medium, and High rankings.  The break point for each of these groupings can be adjusted using the sliders to the right of the crosstab.  

###**Barchart-related Visualizations using Tableau**


####*CapMetro Stops by Zipcode as they Relate to Restaurant Density*
Here we see visualized the number of bus stops per zipcode, colored by the number of restaurants in that zipcode. This visualization also shows an average line for the average number of stops. As is evident, the zipcodes with the darkest bars (those with the highest restuarant density), almost always surpass the average line, but also generally greatly surpass it. This is interesting because it suggests that, in general, zipcodes with a higher than average number of bus stops also have a high restuarant density.

<center>![Tableau: Stops by Zipcode](../03 Visualizations/TBar1A.png)</center>
<br>
<center>![Tableau: Stops by Zipcode, High Restaurant Density](../03 Visualizations/TBar1B.png)</center>

In Tableau, this visualization also allows you to adjust which zipcodes are displayed based on a minimum amount of restaurants, and also based on whether the zipcodes are above or below this minimum.

####*Foreign Population vs. Restaurant Density*
Here we find a visualization that showcases foreign population by zipcode, and is colored to reflect the restaurant density of each zipcode. In Tableau, you are also able to toggle back and forth between a higher than average, or an at/below average foreign population.

<center>![Tableau: Restaurant Density, Low Foreign Population](../03 Visualizations/TBar2A.png)</center>
<br>
<center>![Tableau: Restaurant Density, High Foreign Population](../03 Visualizations/TBar2B.png)</center>

From this barchart, it would seem that higher than average foreign populated zipcodes don't necessarily have a higher restaurant density. Overall, it appears that restaurant density is pretty even across levels of foreign-born population (i.e. there are relatively similar numbers of high restaurant density zipcodes in both categories). This is interesting because it suggests that restaurant density may be fairly independent of foreign- or native-born population status.

####*Stop to Restuarant Ratios per Zipcode by Nativity*
This visualization shows the populations of Native, Naturalized, and Non-Cititzen inhabitants for each zipcode. Further, these barcharts are colored by the restaurant to stop ratio (darker purple indicates a higher ratio). The highest ratios are between 78721 and 78724. 

<center>![Tableau: Stops to Restaurants by Nativity](../03 Visualizations/TBar3.png)</center>

Interestingly, the population of native, naturalized, and non-citizen peoples is quite variable across zipcodes. Another interesting point is that for all three categories, the highest ratios are not correlated with the highest populations. This, of course, seems non-intuitive, and may be an area for future inquiry.

##**Visualizations using Shiny in R**
The screenshots and descriptions below are based on our interactive visualization app, accessible at https://kvaughn.shinyapps.io/restaurant_inspection/.

###**Boxplot Visualization using Shiny in R**
One low score does not indicate a systemic problem with a restaurant's cleanliness; before judging a restaurant with a low score, it behooves us to consider the other scores it has received.  This boxplot allows the user to do just that.  First select a cutoof point for scores, then choose a zipcode to see all restaurants that have scored below that level at least one.  Each restaurant's scores will display, giving the user an idea of whether the low score was a fluke or a sign of problems.

<center>![Shiny: Scores by Restaurant Data](../03 Visualizations/SBoxA.png)</center> 
<br> 
<center>![Shiny: Scores by Restaurant Boxplot](../03 Visualizations/SBoxB.png)</center>

###**Histogram Visualization using Shiny in R**
This visualization allows the user to get a broader view of the restaurant inspection scores.  The user is given the option to select scores from any combination of years.  What's interesting here is that the vast majority of restaurant inspections result in "passing" grades of 70 or higher.

<center>![Shiny: All Scores Data](../03 Visualizations/SHistA.png)</center> 
<br>
<center>![Shiny: All Scores Histogram](../03 Visualizations/SHistB.png)</center>

###**Scatterplot Visualization using Shiny in R**
The Scatterplot tab includes three separate visualizations drawn from one set of data.  The first visualization plots the population of a zipcode against the number of restaurants in that zipcode.  The second graph plots the population per zipcode against the quantity of transit stops in that zipcode.  (Transit stop location information provided by Capital Metro and made available through data.texas.gov: https://data.texas.gov/dataset/GTFS/r4v4-vz24 )  As you see, the correlations between population and restaurant or transity-stop quantity are pretty weak.  The third visualization in this set graphs transit stops by restaurants, and shows a much stronger correlation.  See above for a discussion of one possible explanation for this pattern of relationships.

<center>![Shiny: Scatterplot Data](../03 Visualizations/SScatterData.png)</center> 
<br>
<center>![Shiny: Restaurant by Population Scatterplot](../03 Visualizations/SScatter1.png)</center>
<br>
<center>![Shiny: Transit by Population Scatterplot](../03 Visualizations/SScatter2.png)</center> 
<br>
<center>![Shiny: Restaurant by Transit Scatterplot](../03 Visualizations/SScatter3.png)</center>

###**Crosstab Visualizations using Shiny in R**
Click here to be taken to our published Shiny app: https://kvaughn.shinyapps.io/restaurant_inspection/

####*Safety Index*
This visualization starts with a somewhat-arbitrary KPI, the "Safety Index," that takes into account both the average score and the lowest inspection score received for a given zip code in a given year.  A high Safety Index indicates both a high average score and a high minimum score.  Set the parameters however you like (3 and 8 are reasonable defaults) and check the crosstab to see each zip code's ratings over the four-year period.

<center>![Shiny: Safety Index Data](../03 Visualizations/SCross1A.png)</center> 
<br>
<center>![Shiny: Safety Index Crosstab](../03 Visualizations/SCross1B.png)</center>

####*Lowest Inspection Scores*
This visualization allows the user to set an arbitrary cutoff point as an aid for identifying low inspection scores.  The default value, 70, is the cutoff for restaurant inspections; lower scores will trigger a reinspection.  After selecting a parameter using the slider and requesting data, the user can select the "Crosstab" tab to see which zip codes and years had unacceptably low inspection scores.

<center>![Shiny: Lowest Inspection Scores Data](../03 Visualizations/SCross2A.png)</center>
<br>
<center>![Shiny: Lowest Inspection Scores Crosstab](../03 Visualizations/SCross2B.png)</center>

####*People per Restaurant*
Austin is a fast-growing city, and our restaurant scene is currently booming as well.  This visualization allows a user to see where the density of restaurants compared to population is highest.  Using a public population dataset generated by the US Census, we can find the number of people per restaurant in a given zip code; as more restaurants are added (or fail), the people-per-restaurant number changes.  The default values highlight the density of restaurants in 78701, which is downtown Austin.

<center>![Shiny: People per Restaurant Data](../03 Visualizations/SCross3A.png)</center> 
<br>
<center>![Shiny: People per Restaurant Crosstab](../03 Visualizations/SCross3B.png)</center>

###**Barchart Visualizations using Shiny in R**
These visualizations are also accessible at https://kvaughn.shinyapps.io/restaurant_inspection/

####*Scores by Zip Code*
Barcharts can be a good way to quickly identify trends visually.  In this visualization, the user can create a set of zip codes of interest, then see each zipcode's average score per year over the three or four years contained in the dataset.

<center>![Shiny: Scores by Zip Code Data](../03 Visualizations/SBar1A.png)</center> 
<br>
<center>![Shiny: Scores by Zip Code Bar Chart](../03 Visualizations/SBar1B.png)</center>

####*Transit stops and restaurant density*
Capital Metro, the Austin-area transit service, has made its data freely available for public download.  By importing this data into geographic information software (ArcGIS) and overlaying the coordinates of transit stops on a shapefile of regional zip codes, we were able to count the number of transit stops per zip code.  In this bar chart, we're examining the relationship between transit density and restaurant density.  Use the slider to select zip codes with more than the minimum number of transit stops, then compare the number of restaurants in these zip codes to the average number of restaurants per zip code throughout the region.  The blue line is the calculated average for all of the selected zip codes, and the red is the grand average for all of the zipcodes in the Greater Austin area.

<center>![Shiny: Restaurant Density and Transit Stops Data](../03 Visualizations/SBar2A.png)</center> 
<br>
<center>![Shiny: Restaurant Density and Transit Stops Bar Chart](../03 Visualizations/SBar2B.png)</center>

####*Nonnative residents and restaurant density*
Would we expect zip codes with comparatively large numbers of nonnative (naturalized-citizen or noncitizen) residents to have more or fewer restaurants, on average, than zip codes with lower percentages of nonnative residents?  This barchart allows the user to select a minimum percentage of nonnative residents per zip code, then shows the number of restaurants in that zip code and the average number of restaurants per zip code for all of the qualifying zip codes.  This average can be compared to the grand average throughout the greater Austin area, shown in red.

<center>![Shiny: Restaurant Density and Nonnative Residents Data](../03 Visualizations/SBar3A.png)</center> 
<br>
<center>![Shiny: Restaurant Density and Nonnative Residents Bar Chart](../03 Visualizations/SBar3B.png)</center>

=======
---
title: "Austin Restarant Inspections"
author: "<center><b>Marcus Martinez and Kate Vaughn</b></center>"
date: "<center>May 2, 2017</center>"
output:
  html_notebook:
    toc: yes
    toc_depth: 4
  html_document:
    toc: yes
    toc_depth: 4
---

#**Introduction**
 In this project, we're examining inspection scores for restaurants in and around Austin, Texas, from 2014 through early 2017.  Scores are determined through routine inspections, usually conducted twice a year.  If a score lower than 70 is recorded in a yearly inspection, the restaurant will be subjected to one or more followup visits to ensure a return to compliance.  All of our restaurant inspection records include geographical information in the form of zip codes, allowing the use of additional datasets to further explore the data.  Visit https://data.world/kvaughn/s-17-dv-project-6 to explore our datasets, or see below for the visualizations we've made of the data.

##**R Configuration**
Below we display our sessionInfo().

```{r SessionInfo, echo=FALSE} 
sessionInfo(package=NULL)
```

#**The Data**
The restaurant inspection dataset was provided through the City of Austin's Open Data Portal, https://data.austintexas.gov/.  This free download is available in multiple formats at https://data.austintexas.gov/Health/Restaurant-Inspection-Scores/uthw-a7ih.  For this project, we downloaded a CSV file and cleaned it using the Extract-Translate-Load procedure outlined below.

#**Extract-Translate-Load (ETL)**
The dataset we received needed some cleaning, as it was unreadable to Tableau in its original CSV formatting.  We used the Extract-Translate-Load method discussed in class, and built an R script found at "../01 Data/ETLRestaurant.R"  We also cleaned up the join query used in our Tableau visualizations; it can be found at "../01 Data/ETLQuery.R"  If you'd like to follow along, here's how we processed the restaurant inspection data:

##*Step-By-Step Instructions*
**First, download the uncleaned CSV: https://data.austintexas.gov/api/views/ecmv-9xxi/rows.csv?accessType=DOWNLOAD. Prepend "PreETL_" to the file name, and put it in a folder called CSVs in the same directory as this project (s17dvproject5-vaughn-cannata-martinez).**

###*Step 1: Set up the environment*
Note that we have one column of numeric data ("Score"), one column of dates ("Inspection Date"), three columns of strings ("Restaurant Name", "Address", and "Process Description"), and two columns of numeric data used categorically ("Zip Code" and "Facility ID").  For the sake of ease of use, these last two columns will be treated as strings.
```{r etl1}
# Set working directory
setwd("~/s17dvfinalproject-vaughn-cannata-martinez-1/00 Docs")

# Set filepath
file_path = "../../CSVs/PreETL_Restaurant_Inspection_Scores.csv"

# Create dataframe from original CSV
df <- read.csv(file_path, stringsAsFactors = FALSE)

# Standardize column names
names(df) <- gsub("\\.+", " ", names(df))

# Print column names and types
str(df)
```

###*Step 2: Collect columns by type*
We've grouped the date column separately from measures and dimensions; while Tableau will recognize these data as dates, it seems prudent to reclassify them into a standardized format.
```{r etl2}
# Select string data as 'dimensions'
dimensions <- c("Restaurant Name", "Zip Code", "Address", "Process Description", "Facility ID")

# Select dates as 'dates'
dates <- c("Inspection Date")

# Select all remaining data as 'measures'
measures <- setdiff(names(df), union(dimensions, dates))

```

###*Step 3: Remove special characters*
Here we remove special characters from all of our data.  These characters can serve no purpose in the data, and might cause problems in the analysis process.  To improve readability, special characters have been replaced by spaces.
```{r etl3}
# Get rid of special characters in each column.
for(n in names(df)) {
  df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= " "), stringsAsFactors = FALSE)
}
```

###*Step 4: Modify records in dimensions*
Now we go through all of the character-based columns (grouped as dimensions) and remove troublesome characters.  Ampersands are replaced with "and", semicolons are replaced with colons, and single- and double-quotes are removed outright.  Data of NA (null) is replaced with an empty string.
```{r etl4}
# This function will replace NA data with an empty string
na2emptyString <- function (x) {
  x[is.na(x)] <- ""
  return(x)
}
# We'll apply this to all columns grouped as dimensions
if( length(dimensions) > 0) {
  for(d in dimensions) {
    # Change NA to the empty string.
    df[d] <- data.frame(lapply(df[d], na2emptyString), stringsAsFactors = FALSE)
    # Get rid of " and ' in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""), stringsAsFactors = FALSE)
    # Change & to and in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "), stringsAsFactors = FALSE)
    # Change : to ; in dimensions.
    df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"), stringsAsFactors = FALSE)
  }
}

```

###*Step 5: Modify dates*
Here we ensure that dates are formatted correctly.
```{r etl5}
if( length(dates) > 1 || ! is.na(dates)) {
  for(y in dates) {
    # Format as dates
    df[y] <- data.frame(lapply(df[y], function(y) as.Date(y, format = "%m/%d/%y")), stringsAsFactors = FALSE)
  }
}
```

###*Step 6: Modify measures*
Now we take all of the non-numeric characters out of our measures.
```{r etl6}
na2zero <- function (x) {
  x[is.na(x)] <- 0
  return(x)
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions, and change NA to 0.
if(length(measures) > 1 || ! is.na(measures)) {
  for(m in measures) {
    df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement=""), stringsAsFactors = FALSE)
    df[m] <- data.frame(lapply(df[m], na2zero), stringsAsFactors = FALSE)
    df[m] <- data.frame(lapply(df[m], function(m) as.numeric(as.character(m)))) # This is needed to turn measures back to numeric because gsub turns them into strings.
  }
}
```

###*Step 7: Dummy check*
Always a useful thing.  Let's look at our data and see what we've got.  We're expecting all data to be in character form except "Score", which will be numeric, and our date column named "Inspection Date".
```{r etl7}
# Take a look and make sure it's what you think it should be:
print(summary(df))
```

###*Step 8: Write to a new file*
Our data looks good, and we need to write it to a new file.
```{r etl8}
# Now write it to a new clean file.
write.csv(df, gsub("PreETL_", "", file_path), row.names=FALSE, na = "")
```

#**Visualizations**

##**Interesting Things**
###**Chinatown**  
Generally speaking, low scores are outliers for restaurants.  The inspector caught them on a bad day, perhaps.  Look at the histogram for an indication of the proportion of passing scores (above 70) versus failing scores.  
<center>![All Scores Histogram](../03 Visualizations/SHistB.png)</center>
<br>
If you're interested in seeing the whole collection of scores for a restaurant with at least one low score, use our Shiny boxplot and select the zipcode of interest.  Some restaurants demonstrate a trend in their low scores that goes beyond "one bad day".  Chinatown, in 78731, is a good example of this.  A score of 70 or higher is considered 'passing,' while a score below 70 triggers an automatic reinspection of the facility.  Restaurants are encouraged to aim for scores approaching 100.  Chinatown is apparently just aiming to pass.  With an average score of 71, it's meeting its goal.
<center>![Chinatown's Average Score](../03 Visualizations/SBoxB.png)</center>
<br>

###**The Limitations of our Dataset** 
We originally expected strong correlations between the number of restaurants in a given zip code and the population of that zip code, with an expectation that downtown would be an outlier (lots of restaurants, relatively few residents).  When we graphed our data, we found that restaurant quantity and population were only weakly correlated, and we found a number of unexpected outliers.  
<center>![Restaurant by Population Scatterplot](../03 Visualizations/SScatter1.png)</center>
<br>
Similarly, we expected that the number of transit stops would be strongly correlated with the number of restaurants in a given zipcode.  This time, our data supported our expectations, with a moderately strong correlation of 0.756.  
<center>![Restaurant by Transit Scatterplot](../03 Visualizations/SScatter3.png)</center>
<br>
Curiosity then led us to plot population versus number of transit stops for a given zipcode; again, the two were only weakly correlated, with many of the same outliers.  
<center>![Transit by Population Scatterplot](../03 Visualizations/SScatter2.png)</center> 
<br>
This prompted a closer look at those outliers, and demonstrated a shortcoming in our dataset.  It turns out that the outliers with large populations and few transit stops or restaurants are in the far suburbs of Austin.  These areas are governed by other civic bodies, and thus the bulk of their restaurant inspections (and, potentially, transit) will be conducted by an entity other than the City of Austin.  In order to get a cleaner view of the relationship between population, transit, and restaurant density, we would need to adjust our data to only include areas under the jurisdiction of the City of Austin. 
A good visualization of this distribution can be seen below.
<center>![Dataset Limitations Map](../03 Visualizations/TBOutlierMap.png)</center>
<br>

###**Restaurant Density and Non-native Population**
The United States Census provides information on the number of residents per zipcode that are native citizens, as well as how many residents are naturalized citizens or noncitizens.  For the purposes of this visualization, we've grouped "naturalized citizens" and "noncitizens" into one category, "non-native."  Would zipcodes with relatively high percentages of non-native residents be more or less restaurant-heavy than zipcodes with lower percentages of non-native residents?  It turns out that zip codes with high percentages (in the case of this visualization, 25% or more) non-native residents do in fact have significantly more restaurants on average than other zipcodes.
<center>![Non-native Residents](../03 Visualizations/Nonnative.png)</center>
<br>


##**Visualizations using Tableau**

###**Boxplot using Tableau**

###**Raw Scores- Histogram**
As can be seen in the visualization below, the most common raw restaurant inspection score is between 90 and 100. The 90-95 and 95-100 bins are essentially the same. The second most common ranges are 85-90 and 100-105. These are also the same size, but a little over half the size of the most common score range. 

<center>![Tableau: Raw Score Histogram](../03 Visualizations/THist.png)</center>

The next largest bin is half the size of the second largest, and this pattern continues until we drop below 70. This is particularly good news, as it implies that failing scores (those below 70) are not very common at all, and furthermore that exceptional scores are highly common.

###**CapMetro Stops and Restaurant Density as they Relate to Population- Scatterplot**
Here we see a comprehensive visualization that reflects data as we've seen it before, in our "interesting things" section. This scatterplot shows restaurant and stop density per zipcode. Each point is sized by the population of that zipcode. Ultimately, we see an overall positive trend. However, upon closer examination, it is clear that many high population zipcodes have a lower restaurant and stop density. 

<center>![Tableau: Restaurant and Stop Density Scatterplot](../03 Visualizations/TBScatterplot.png)</center>

The implications of this have already been discussed in previous sections. However, it is also interesting to note that many zipcodes (regardless of population size), are grouped towards the lower to medium restaurant and stop density area. In fact, the further out in either direction you go, the less instances there are. This likely shows that while there is a slight positive correlation between stop and restaurant density, most areas are not likely to have an exceptional number of eihter. 

###**Crosstab-related Visualizations using Tableau**
Below you will find the visualizations we created using Tableau. Each visualization is only a fraction of the information that can be viewed in Tableau, as these crosstabs cannot showcase all the information available at once.  Our Tableau workbook is included in our repository in the folder 01 Data.

####*Inspection Scores, Area, and Income*
This crosstab shows by year and zipcode (in Austin), the average restaurant inspection score. The crosstab is colored by the percentage of households in each area under the national poverty line (pulled from US Census data). As is evident, average scores remain relatively high across poverty levels and zipcodes. In Tableau you are able to cycle through zipcodes in the western or eastern half of the greater Austin area. Even at this level, average scores remain similar. There are more average scores below 90 in the eastern section, but this is likely not significant.

<center>![Tableau: Scores by Income and Area](../03 Visualizations/CrosstabAverageScorebyPovertyLevel.png)</center>

It is possible this is unique to Austin, and may be a consequence of stringent food safety code in these areas. Further analysis would need to be done to see if these correlations are true elsewhere in the United States.

####*Average Inspection Scores Geographically*
As can be seen, this visualization displays the average inspections score for each zipcode area. Each zipcode's color is determined by its average inspection score's proximity to the grand average for the year(s) selected in the filter.  This is done using a calculated field, and will be recalculated according the the data selected in the Year filter.  If arriving at this map by clicking on either of the crosstabs, the zip codes will display the difference from the grand average of the selected year.

<center>![Tableau: Map of Average Inspection Scores](../03 Visualizations/CrosstabInspectionScoreMap.png)</center>

It is again evident that average scores are relatively high across each area. 

####*Average Score by Year*
This highly comprehensive visualization shows us three main pieces of information: average score, minimum score, and safety index. As has been previously noted, average scores are relatively high in each zipcode. Minumum scores can be quite low (one of the lowest scores is 36), but these are few and far between.

<center>![Tableau: Average Score by Year](../03 Visualizations/CrosstabScoresbyYear.png)</center>

This crosstab is also colored by a Safety Index.  This KPI is based in a calculated field that accounts for the average score and minimum score for a given zip code in a given year, then parameterized into Low, Medium, and High rankings.  The break point for each of these groupings can be adjusted using the sliders to the right of the crosstab.  

###**Barchart-related Visualizations using Tableau**


####*CapMetro Stops by Zipcode as they Relate to Restaurant Density*
Here we see visualized the number of bus stops per zipcode, colored by the number of restaurants in that zipcode. This visualization also shows an average line for the average number of stops. As is evident, the zipcodes with the darkest bars (those with the highest restuarant density), almost always surpass the average line, but also generally greatly surpass it. This is interesting because it suggests that, in general, zipcodes with a higher than average number of bus stops also have a high restuarant density.

<center>![Tableau: Stops by Zipcode](../03 Visualizations/TBar1A.png)</center>
<br>
<center>![Tableau: Stops by Zipcode, High Restaurant Density](../03 Visualizations/TBar1B.png)</center>

In Tableau, this visualization also allows you to adjust which zipcodes are displayed based on a minimum amount of restaurants, and also based on whether the zipcodes are above or below this minimum.

####*Foreign Population vs. Restaurant Density*
Here we find a visualization that showcases foreign population by zipcode, and is colored to reflect the restaurant density of each zipcode. In Tableau, you are also able to toggle back and forth between a higher than average, or an at/below average foreign population.

<center>![Tableau: Restaurant Density, Low Foreign Population](../03 Visualizations/TBar2A.png)</center>
<br>
<center>![Tableau: Restaurant Density, High Foreign Population](../03 Visualizations/TBar2B.png)</center>

From this barchart, it would seem that higher than average foreign populated zipcodes don't necessarily have a higher restaurant density. Overall, it appears that restaurant density is pretty even across levels of foreign-born population (i.e. there are relatively similar numbers of high restaurant density zipcodes in both categories). This is interesting because it suggests that restaurant density may be fairly independent of foreign- or native-born population status.

####*Stop to Restuarant Ratios per Zipcode by Nativity*
This visualization shows the populations of Native, Naturalized, and Non-Cititzen inhabitants for each zipcode. Further, these barcharts are colored by the restaurant to stop ratio (darker purple indicates a higher ratio). The highest ratios are between 78721 and 78724. 

<center>![Tableau: Stops to Restaurants by Nativity](../03 Visualizations/TBar3.png)</center>

Interestingly, the population of native, naturalized, and non-citizen peoples is quite variable across zipcodes. Another interesting point is that for all three categories, the highest ratios are not correlated with the highest populations. This, of course, seems non-intuitive, and may be an area for future inquiry.

##**Visualizations using Shiny in R**
The screenshots and descriptions below are based on our interactive visualization app, accessible at https://kvaughn.shinyapps.io/restaurant_inspection/.

###**Boxplot Visualization using Shiny in R**
One low score does not indicate a systemic problem with a restaurant's cleanliness; before judging a restaurant with a low score, it behooves us to consider the other scores it has received.  This boxplot allows the user to do just that.  First select a cutoof point for scores, then choose a zipcode to see all restaurants that have scored below that level at least one.  Each restaurant's scores will display, giving the user an idea of whether the low score was a fluke or a sign of problems.

<center>![Shiny: Scores by Restaurant Data](../03 Visualizations/SBoxA.png)</center> 
<br> 
<center>![Shiny: Scores by Restaurant Boxplot](../03 Visualizations/SBoxB.png)</center>

###**Histogram Visualization using Shiny in R**
This visualization allows the user to get a broader view of the restaurant inspection scores.  The user is given the option to select scores from any combination of years.  What's interesting here is that the vast majority of restaurant inspections result in "passing" grades of 70 or higher.

<center>![Shiny: All Scores Data](../03 Visualizations/SHistA.png)</center> 
<br>
<center>![Shiny: All Scores Histogram](../03 Visualizations/SHistB.png)</center>

###**Scatterplot Visualization using Shiny in R**
The Scatterplot tab includes three separate visualizations drawn from one set of data.  The first visualization plots the population of a zipcode against the number of restaurants in that zipcode.  The second graph plots the population per zipcode against the quantity of transit stops in that zipcode.  (Transit stop location information provided by Capital Metro and made available through data.texas.gov: https://data.texas.gov/dataset/GTFS/r4v4-vz24 )  As you see, the correlations between population and restaurant or transity-stop quantity are pretty weak.  The third visualization in this set graphs transit stops by restaurants, and shows a much stronger correlation.  See above for a discussion of one possible explanation for this pattern of relationships.

<center>![Shiny: Scatterplot Data](../03 Visualizations/SScatterData.png)</center> 
<br>
<center>![Shiny: Restaurant by Population Scatterplot](../03 Visualizations/SScatter1.png)</center>
<br>
<center>![Shiny: Transit by Population Scatterplot](../03 Visualizations/SScatter2.png)</center> 
<br>
<center>![Shiny: Restaurant by Transit Scatterplot](../03 Visualizations/SScatter3.png)</center>

###**Crosstab Visualizations using Shiny in R**
Click here to be taken to our published Shiny app: https://kvaughn.shinyapps.io/restaurant_inspection/

####*Safety Index*
This visualization starts with a somewhat-arbitrary KPI, the "Safety Index," that takes into account both the average score and the lowest inspection score received for a given zip code in a given year.  A high Safety Index indicates both a high average score and a high minimum score.  Set the parameters however you like (3 and 8 are reasonable defaults) and check the crosstab to see each zip code's ratings over the four-year period.

<center>![Shiny: Safety Index Data](../03 Visualizations/SCross1A.png)</center> 
<br>
<center>![Shiny: Safety Index Crosstab](../03 Visualizations/SCross1B.png)</center>

####*Lowest Inspection Scores*
This visualization allows the user to set an arbitrary cutoff point as an aid for identifying low inspection scores.  The default value, 70, is the cutoff for restaurant inspections; lower scores will trigger a reinspection.  After selecting a parameter using the slider and requesting data, the user can select the "Crosstab" tab to see which zip codes and years had unacceptably low inspection scores.

<center>![Shiny: Lowest Inspection Scores Data](../03 Visualizations/SCross2A.png)</center>
<br>
<center>![Shiny: Lowest Inspection Scores Crosstab](../03 Visualizations/SCross2B.png)</center>

####*People per Restaurant*
Austin is a fast-growing city, and our restaurant scene is currently booming as well.  This visualization allows a user to see where the density of restaurants compared to population is highest.  Using a public population dataset generated by the US Census, we can find the number of people per restaurant in a given zip code; as more restaurants are added (or fail), the people-per-restaurant number changes.  The default values highlight the density of restaurants in 78701, which is downtown Austin.

<center>![Shiny: People per Restaurant Data](../03 Visualizations/SCross3A.png)</center> 
<br>
<center>![Shiny: People per Restaurant Crosstab](../03 Visualizations/SCross3B.png)</center>

###**Barchart Visualizations using Shiny in R**
These visualizations are also accessible at https://kvaughn.shinyapps.io/restaurant_inspection/

####*Scores by Zip Code*
Barcharts can be a good way to quickly identify trends visually.  In this visualization, the user can create a set of zip codes of interest, then see each zipcode's average score per year over the three or four years contained in the dataset.

<center>![Shiny: Scores by Zip Code Data](../03 Visualizations/SBar1A.png)</center> 
<br>
<center>![Shiny: Scores by Zip Code Bar Chart](../03 Visualizations/SBar1B.png)</center>

####*Transit stops and restaurant density*
Capital Metro, the Austin-area transit service, has made its data freely available for public download.  By importing this data into geographic information software (ArcGIS) and overlaying the coordinates of transit stops on a shapefile of regional zip codes, we were able to count the number of transit stops per zip code.  In this bar chart, we're examining the relationship between transit density and restaurant density.  Use the slider to select zip codes with more than the minimum number of transit stops, then compare the number of restaurants in these zip codes to the average number of restaurants per zip code throughout the region.  The blue line is the calculated average for all of the selected zip codes, and the red is the grand average for all of the zipcodes in the Greater Austin area.

<center>![Shiny: Restaurant Density and Transit Stops Data](../03 Visualizations/SBar2A.png)</center> 
<br>
<center>![Shiny: Restaurant Density and Transit Stops Bar Chart](../03 Visualizations/SBar2B.png)</center>

####*Nonnative residents and restaurant density*
Would we expect zip codes with comparatively large numbers of nonnative (naturalized-citizen or noncitizen) residents to have more or fewer restaurants, on average, than zip codes with lower percentages of nonnative residents?  This barchart allows the user to select a minimum percentage of nonnative residents per zip code, then shows the number of restaurants in that zip code and the average number of restaurants per zip code for all of the qualifying zip codes.  This average can be compared to the grand average throughout the greater Austin area, shown in red.

<center>![Shiny: Restaurant Density and Nonnative Residents Data](../03 Visualizations/SBar3A.png)</center> 
<br>
<center>![Shiny: Restaurant Density and Nonnative Residents Bar Chart](../03 Visualizations/SBar3B.png)</center>

>>>>>>> fd5d49a0e53324f45c1a48a192d19a4d73915bca
>>>>>>> 4132fa34d178208600e7c040ceb934c37440208b